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ABSTRACT 

MicroRNAs (miRNAs) regulate genes post tran- 
scription by pairing with messenger RNA (mRNA). 
Variants such as single nucleotide polymorphisms 
(SNPs) in miRNA regulatory regions might result in 
altered protein levels and disease. Genome-wide 
association studies (GWAS) aim at identifying 
genomic regions that contain variants associated 
with disease, but lack tools for finding causative 
variants. We present a computational tool that can 
help identifying SNPs associated with diseases, by 
focusing on SNPs affecting miRNA-regulation of 
genes. The tool predicts the effects of SNPs in 
miRNA target sites and uses linkage disequilibrium 
to map these miRNA-related variants to SNPs of 
interest in GWAS. We compared our predicted SNP 
effects in miRNA target sites with measured SNP 
effects from allelic imbalance sequencing. Our pre- 
dictions fit measured effects better than effects 
based on differences in free energy or differences 
of TargetScan context scores. We also used our 
tool to analyse data from published breast cancer 
and Parkinson's disease GWAS and significant 
trait-associated SNPs from the NHGRI GWAS 
Catalog. A database of predicted SNP effects 
is available at http://www.bigr.medisin.ntnu.no/ 
mirsnpscore/. The database is based on haplotype 
data from the CEU HapMap population and miRNAs 
from miRBase 16.0. 

INTRODUCTION 

MicroRNAs (miRNAs) are small non-coding single 
stranded RNAs of about 22 nucleotides length that 
regulate genes post transcription by partially pairing 
with 3'-untranslated regions (3'-UTR) of messenger 



RNA (mRNA) (1). Watson-Crick pairing to nucleotides 
2-7 of the 5'-end of microRNAs (seed sites) is known to 
be important in mRNA targeting. Specifically, miRNAs 
require almost perfect complementarity at seed sites for 
binding and reducing the protein levels of targets (2). 
However, mRNA sites with perfect complementarity to 
the seed nucleotides are not necessarily functional (3) 
and those with imperfect seed complementarity can also 
be functional (2). Consequently, considering seed sites 
alone gives many false positive miRNA target sites. 
Predictions can be improved, however, by using informa- 
tion about the target sites' context, such as their position 
within the 3'-UTR (4) and the distance to neighbouring 
sites (5), as such context is critical for target site function- 
ality and efficacy. 

Genome-wide association studies (GWAS) can identify 
genomic regions that contain genomic alterations, such as 
single nucleotide polymorphisms (SNPs), associated with 
common disease (6). The biological effects of identified 
alterations are usually not known, however, as few of 
the functional variants that show association in GWAS 
change the amino acid sequence. Moreover, a sizeable 
proportion is thought to reside in regulatory regions, 
since several associated regions found in GWAS lack 
known genes (7). Variants in regulatory regions can, for 
example, result in altered protein levels, so identifying and 
understanding their effects can improve diagnostics and 
treatments for diseases (8). Specifically, SNPs in regula- 
tory elements such as miRNA target sites can affect 
phenotype (9) and have been associated with increased 
cancer risk (10) and other diseases (11). The increased 
use of GWAS to study genetic factors in common 
disease necessitates a tool that can identify and interpret 
effects of regulatory variants. 

Several research groups have tried to look at regulatory 
variant effects. Bao et al. (12) looked for SNPs in putative 
conserved miRNA target sites [from the target site predic- 
tion tool TargetScan (13)], and integrated such SNP sites 
with phenotype (physiological and behavioural traits 
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Figure 1. Identifying SNPs in miRNA target sites. The illustration 
shows an mRNA region that contains SNPs represented by small 
vertical lines. The considered SNP has two alleles: A and G. We 
make one subsequence for each allele by using the flanking regions 
of the SNP (7 nucleotides on each side). Given miRNA seed motifs 
(nucleotides 2-8 from the 5'-end of miRNA sequences), we look for 
target sites in each allele sequence and then compare results to charac- 
terise the effect of the SNP (create/delete (CRT/DEL) target sites, or 
change (CHG) site type). 



for targeting by a microRNA seed motif m. Specifically, 
for each allele a h we determine whether there is a 
microRNA target site in a sequence als t consisting of the 
allele a t and its flanking sequences. Target sites are 
detected by using any miRNA target site prediction tool 
based on sequence search. It is convenient to disregard 
target sites with mismatches in the seed region and only 
consider 6-mer, 7-mer and 8-mer seed sites. For each 
allelic sequence als h we get a list /,- of target sites for 
microRNA m. We can then compare these lists to deter- 
mine if a target site is created, deleted, or changed between 
the alleles (Figure 1). 

All existing tools use variants of the approach above 
of evaluating candidate sites individually (Figure 1), but 
this approach ignores that 3'-UTRs can contain multiple 
linked SNPs that can affect miRNA targeting by altering 
site context. Instead, we propose to analyse all the SNPs 
of the 3'-UTR at the same time, to have a general overview 
of the SNPs' regulatory effect on the considered mRNA. 

In this article, we present a computational tool that 
can help identifying SNPs causative to diseases, such as 
cancer. The tool focuses on SNPs that may affect miRNA 
targeting and thereby cause gene dysregulation. More pre- 
cisely, the tool predicts the effects of SNPs in miRNA 
target sites and uses linkage disequilibrium to map those 
mirSNPs to SNPs of interest in GWAS. We show that the 
tool's predictions correspond well to the SNP's measured 
effects on miRNA regulation, and that the predictions 
correlate better to those effects than do the predictions 
of other existing tools. We further demonstrate the 
tool's utility by analysing two published GWAS data 
sets and specific SNPs reported to affect miRNA 
targeting. 



of mice as quantitative trait loci) and expression data 
(of mice and human transcripts) into a database. 
However, the studied phenotypes only concern physiology 
of mice instead of human diseases. Georges et al. (14) also 
made a database with SNPs in putative miRNA target 
sites [regulatory motifs identified in (15) and predicted 
sites from (13)], but Georges et al. (14) did not map 
their site SNPs to phenotypes, except for one SNP in 
sheep. Barenboim et al. (16) developed an online tool 
that finds SNPs in microRNA target sites on the fly. 
The tool takes haplotype into account, but is limited to 
one single gene and six SNPs per run and does not 
quantify SNP effects. Nicoloso et al. (17) used the 
miRanda tool (18) to identify breast cancer-associated 
SNPs that disrupt miRNA target sites. The authors 
filtered SNPs based on minimum free energy (MFE) and 
tested the remaining ones in a case-control study. 

A basic way of detecting SNPs in microRNA target 
sites (mirSNPs) in a gene g, starts by looking at SNPs 
lying in a region of interest, such as 3'-UTR, 5'-UTR, 
coding or promoter region (Figure 1). Here, we will use 
the 3'-UTR as an example, since SNPs affecting miRNA 
target sites are more likely to reside in the 3'-UTR (19,20). 
Let us consider a SNP s in this region of interest. The SNP 
s has several alleles, usually two, that we want to evaluate 



MATERIALS AND METHODS 

The following sections will present a method that uses 
context-based miRNA target prediction to quantify the 
effects of SNPs in miRNA target sites (mirSNPs) and 
uses linkage disequilibrium to map candidate mirSNPs 
to disease data from GWAS. The tool allows additional 
filtering of candidate genes and candidate miRNAs. 
The tool's mapping method is general and can therefore 
be applied to SNPs independent of the scoring method 
used. 

Data 

We used the SNP data from the human haplotype map 
project [HapMap, (21)]; particularly, SNP data from the 
CEU population (CEPH - Utah residents with ancestry 
from northern and western Europe), release 22 for haplo- 
type data, and release 27 for linkage disequilibrium data. 
We used DNA sequences from the human and mouse 
genome assemblies hgl8 and mm9 (22,23). SNPs and 
Gene annotations (hgl8,mm9) came from UCSC 
Genome browser (24). MicroRNA sequences came from 
miRBase, release 13.0 and 16.0 (25). GWAS data were 
from a breast cancer study from Cancer Genetic 
Markers of Susceptibility (CGEMS) (26), from a 
Parkinson disease study (P- values from tier 1) (27), and 
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from the NHGRI GWAS catalog (28) (http://www 
.genome.gov/gwastudies). 

MicroRNA regulation score of haplotypes 

To analyse all the SNPs of the 3'-UTR at the same time, 
we use population haplotype data for the 3'-UTR 
(Figure 2 and Supplementary Figure SI). Specifically, we 
first use haplotype data to build haplotype sequences hs{, 
i.e. 3'-UTR sequences containing the combinations of 
alleles found in the considered population. Second, for a 
given miRNA m, we use a miRNA target prediction tool 
(29) to score each haplotype sequence hs t . The prediction 
tool uses a two-step SVM classifier, where one SVM step 
classifies individual target sites and a subsequent SVM 
step classifies overall mRNA targeting potential. 
Features the SVM uses at the first step include seed 
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Figure 2. Scoring SNPs in miRNA target sites. rs3019 and rs2281627 
are SNPs in the 3'-UTR of TRIM32. There are 3 different haplotypes 
in the CEU population: UC/UU/CU. TRIM32 is targeted by miR-511, 
but the U allele of rs2281627 disrupts one seed site, which results in a 
lower score S 2 for the UU/CU haplotypes. To identify rs2281627 as the 
effect SNP, first the 3 haplotypes Hi, H 2 and H 3 are grouped by scores 
into Gj and G 2 - Second, we identify the differences between haplotypes 
from groups G\ and G2, i.e. differences between Hi and H2 and 
between H x and H 3 . Third, we cluster those haplotype differences, so 
that the intersection within the cluster is not empty; here, there is only 
one cluster. Finally, we take the intersection of haplotype differences 
within this cluster, which gives the SNP rs2281627. Similarly, rs61 14999 
and rs6132784 lie in the 3'-UTR of ACSS1. There are 3 haplotypes: 
GC/GU/AU. Both SNPs lie outside of any seed sites of miR-452, but 
rs6 132784 lies in a 3'-supplementary site and has a small effect on the 
scores. 



pairing, 3' supplementary pairing, the site's AU context 
and relative position in the 3'-UTR, and distance to neigh- 
bouring sites, whereas features at the second step include 
3'-UTR length, the number and predicted strength of 
target sites, and the number of optimally spaced sites in 
the 3'-UTR (29). As output, the SVM-based prediction 
tool gives a score such that a high output score indicates 
that the miRNA m is likely to down-regulate this mRNA. 
Third, we compare the score-haplotype pairs to find the 
differences of haplotypes that can explain any differences 
of SVM scores. From the differences of haplotypes, we 
can make a list of candidate SNPs and predict their 
impact on gene regulation. 

The haplotype score comparison works as follows. First 
we group haplotypes H, by scores, since we are interested 
in score differences: 

G s = {Hi e H\Score{Hi) = s}. 

Second, we look at the difference of haplotypes between 
groups, to identify which SNPs differ between two score 
groups: V(G,„, G„), m^n, W/,e G m , yH,-e G n , 

AHap/Ojj — {snp\Hi{snp) ^ Hj(snp)}. 

Third, we cluster the AHaplo SNP sets, to handle par- 
ticular cases such as two SNPs in one target site 
(Supplementary Figure S2). Specifically, we cluster 
AHaplo sets such that in each cluster, the intersection of 
all the AHaplOjj of the cluster is not empty: 

Clust k = {AHaplo y\ P| AHaplo , y ^ 0}. 

Fourth, we take the intersection of the AHaplo SNP sets 
in each cluster, to identify which SNP is responsible for 
the score difference in each cluster: 

Intersk = O Clustk = AHaplo^. 

AHaplOjj e Chist^ 

Finally, we merge all the clusters to create a list of SNPs 
responsible for the score difference for the clusters: 

Candidate m „ = \^Jlnterst- 

k 

Candidate,,,,, are candidate SNPs that might explain the 
difference between the scores m and 11. 

Normalization of target site scores 

The miRNA target site prediction tool (29) predicts both 
the targeting potential of individual candidate sites and 
the total regulatory potential of candidate 3'-UTRs; i.e. 
if a gene's 3'-UTR sequence contains one or more candi- 
date miRNA target sites, the tool scores the miRNA's 
regulatory effect on the target gene. However, the tool 
does not score mRNAs without target site candidates. 
Consequently, to score and compare scores for sequences 
with and without candidate sites, we needed to create a 
normalized score. The desired distribution should be 
mainly uniform, because the difference between two trans- 
formed scores should reflect a difference in percentiles in 
the original distribution. Since we only get scores for 
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sequences with target sites, we had to find a way to score 
sequences that do not have target sites and to com- 
pare sequences with and without target sites. Our 
solution consisted of normalizing the scores in the 
interval [0, 1]. As there are more sequences without 
target sites than with target sites, we normalized scores 
so that the codomain of the normalization has an expo- 
nential distribution in [0, 0.01] and a uniform distribution 
in [0.01, 1], according to the following probability density 
function: 



dfiy) 
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Here, x is the threshold that separates the two distribu- 
tions in the codomain. To jointly score sequences with and 
without target sites, we considered sequences with only 
one target site as an intermediate. Since we needed to 
put the worst target site scores in the exponential part, 
we used the score distribution of mRNAs that have only 
one target site, which is a 6-mer. Specifically, we used 
the fifth percentile of the 6-mer distribution to define 
the threshold T: P(X 6m <T) = 0.05. This threshold then 
separated the exponential distribution from the uniform 
distribution in the domain of the normalization 
morphism. As a result, the exponential part contained 
scores for sequences that have no target site (TS) 
(including those with mismatch target sites) or canonical 
target sites with a score lower than T. The proportion of 
scores that will be in the uniform part is Pumf = 
P[X> T\P TS , where P TS is the probability of having a 
target site and P[X> T] is the proportion of scores 
greater than T. The proportion of scores in the expo- 
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We chose x = 0.01 as a trade-off between x being so small 
that all the scores from the exponential part had the same 
tendency, and being so large that we could find the a that 
minimized /fa). 

Mapping candidate SNPs to disease 

We can map candidate mirSNPs to disease by filtering on 
genes that are dysregulated in a given disease, filtering 
on miRNAs that are dysregulated in a given disease, 
and filtering on disease-associated SNPs from the same 
genomic region as the candidate. As filtering on genes or 
miRNAs simply involves focusing on subsets of the UTRs 
or miRNAs, we detail the filtering on disease-associated 
SNPs. 

Association studies can show association of marker 
SNPs with a disease, but not necessarily association of a 
causal SNP with the disease. Consequently, if we want to 
know whether a candidate mirSNP may be causal, we first 
have to map it to associated marker SNPs. 



Mapping candidate SNPs to association studies consists 
in looking for GWAS top ranking SNPs that have been 
inherited together with our candidate SNPs; i.e. looking 
for candidate SNPs that have alleles that correlate with 
alleles of associated marker SNPs. This can be achieved by 
computing inheritance blocks. 

Inheritance blocks are DNA regions with highly 
correlated alleles. Consequently, by knowing the alleles 
of one SNP of the block one can predict the alleles at 
another SNP of the block. This measure of inheritance 
is called linkage disequilibrium (LD). Given a candidate 
SNP, we can compute its inheritance block, according to 
HapMap data. The block is an area of strong linkage 
disequilibrium and shows SNPs that have high correlation 
between themselves and with the candidate SNP. 

We can define a block as a set of successive SNPs: 



Block = [si, 



, Sri 



where si and s r are the left and right bound SNPs of the 
block. 

A block spine is a set of LD values: 
Spine = [D'y] U {D' ir }, 

such that / <j < r and l<i<r and where D' is the linkage 
disequilibrium between the SNPs s x and s r . In short, the 
spine consists of the borders of the block (the two borders 
of the triangle block). 

A solid spine is a spine where a relative amount a of 
the spine's LD values is below a threshold T. For example, 
we can use a = 10% and T = 0.8, to detect blocks with 
strong LD. 

The block detection method (Figure 3) is called 
Solid Spine by Expansion and is an adaptation of the 
Solid Spine algorithm developed within the Haploview 
software (30). This expansion algorithm uses a candidate 
SNP as input. It starts the expansion from this SNP and 
then tries to expand the block successively in the down- 
stream and upstream directions. An expansion occurs if 
the spine of the expanded block fits a rule depending on a 
and T. This algorithm needs an area of high LD to 
expand, which ensures that the algorithm returns few 
false positive blocks. The expansion can start on the left 
side as well as on the right side and the two directions can 
give different results. As we are interested in finding all 
SNPs that reside in blocks that have high LD with of the 
input SNP, we consider both resulting blocks. 

Given a block of SNPs identified by the Solid Spine by 
Expansion algorithm above, we then extract GWAS top 
ranking SNPs from the block, to identify if the candidate 
SNP is correlated with any associated SNPs. We consider 
a SNP to be top-ranking when its rank is less than a given 
threshold. 

We define three scores to assess the level of LD of the 
block defined by the candidate SNP and a top ranking 
SNP. The spine score is the mean of all LD values of 
the spine between the SNPs s x and s v : 
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Figure 3. Example of a linkage disequilibrium block. Given an input 
SNP, we compute its linkage disequilibrium block (delimited by dark 
lines), and then look for top ranking SNPs in the block (here a SNP 
ranking as 351). 



The triangle score is the mean of all LD values of the inner 
triangle between the SNPs s x and s y : 
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A block score is the sum of the spine score and the 
triangle score: 
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RESULTS 

We first use data from allelic imbalance sequencing (31) to 
test our SNP scoring method and to compare our method 
with existing ones. Then we use two different GWAS data 
sets to evaluate the mapping method. Finally, we show 
that the method can find known altered miRNA targets 
associated with disease. 
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Figure 4. Predicted SNP effects correspond with observed effects. 
Correlation between the measured allelic ratio AAR and (A) the differ- 
ence of our predicted allelic scores AS (with transformation), (B) MFE 
differences, and (C) TargetScan score differences (without transform- 
ation, but where the minimum TargetScan value represents the score 
for sequences without predicted target sites). See Table 1 for correl- 
ations and P-values. 



Scoring method predicts effects of mirSNPs 

Kim and Bartel (31) used allelic imbalance sequencing 
to measure for three miRNAs, in vivo miRNA-directed 
repression at polymorphic target sites in mice. They 
provide allelic ratios (target versus non-target allele) 

AR = Marget allele] f fi5 SN p s . 3 '_ UTRs tha t Create 

\non target allele] 

or disrupt miRNA target sites in tissues expressing 
(AR E ) and not expressing (AR NE ) the considered 
miRNA. We used 47 of these SNPs (those that have 
both allelic ratios AR E and AR NE ) to test our method. 
For each of these 47 SNPs, we computed miRNA regula- 
tion scores for the target allele S T and non-target allele 
S NT . We compared the difference of our scores between 
the two alleles AS = St — Snt with the difference of loga- 
rithms of allelic ratios AAR = \og 2 (AR NE ) — \og 2 (AR E ) 
(Figure 4) and found a clear and significant correlation 
(Pearson's correlation P-value 0.0025, Spearman's rank 
correlation P- value 0.00019). 

In comparison, using MFE given by RNAhybrid 2.1 
(32) to predict SNP effects gave insignificant correlations, 
whereas using TargetScan 5.0 context scores (13) (com- 
puted without taking conservation into account) gave 



significant but lower correlation (Table 1). Furthermore, 
our normalization method could improve the correlation 
based on TargetScan scores. 

This result suggests that our scoring method for SNP 
effects fits data from allelic imbalance sequencing better 
than TargetScan context scores (13) or changes in MFE 
[for example, used in (17)]. Our method therefore appears 
to be the best choice for predicting effects of SNPs in 
microRNA target sites. 

ANALYSIS OF GWAS DATA 

To generate a list of candidate SNPs involved in 
miRNA-based regulation, we computed differences of 
scores for all 3'-UTR haplotypes for all coding genes 
(UCSC RefSeq Genes hgl8) and all miRNAs (from 
miRBase 13.0). Specifically, we analysed mRNAs that 
had more than 1 haplotype in their 3'-UTR (12 808 of 
the 26 963 coding transcripts) according to the CEU popu- 
lation from HapMap. Of the 12 808*698 = 89 39 984 
mRNA/miRNA pairs, 396 851 had at least one haplotype 
score that differed from the other haplotype scores of the 
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Table 1. Correlations between the measured allelic ratio AAR 
and predicted SNP effects from several methods 



Method Pearson's corr. Spearman's corr. 
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SVM (raw scores) 
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MFE (no helix constraint) 


0.223 


0.1324 
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0.2345 


MFE (helix constraint 2-7) 


0.124 
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0.084 


0.5736 


TargetScan (raw scores) 


0.168 


0.2582 


0.394 


0.0062 


TargetScan (w/ transformation) 


0.299 


0.0409 


0.413 


0.0039 



same mRNA/miRNA pair. As explained in the methods, 
the haplotype score distribution has an exponential and a 
uniform part. Consequently, differences of scores also 
have a distribution with an exponential part, describing 
small differences in miRNA targeting. We used a thresh- 
old of 0.15 to filter out the exponential part. Of the 
396 851 mRNA/miRNA pairs (which correspond to 
401 983 AS values, as several mRNAs had several haplo- 
type score differences), 55 707 pairs (60 751 AS values) 
had at least one AS > 0.15. We selected the SNPs that 
generated a difference in score AS > 0.1 5 as candidate 
SNPs (18 325 SNPs). 

To further analyse the candidate mirSNPs, we mapped 
the mirSNPs to the breast cancer GWAS from CGEMS, 
as described in the methods. One would usually choose a 
high T threshold as parameter for the mapping method to 
identify blocks with high LD. We chose T = 0, however, 
to have data with low LD to analyse the block score vari- 
ation in relation to the SNP and GWAS scores, as the 
block scores quantify the link between the candidate 
mirSNPs and the GWAS SNPs. We computed block 
scores for each pair of candidate SNP and top ranking 
SNP detected by the mapping method. 

Top-ranking SNPs are likely in strong LD with their 
causative SNP. Consequently, we would expect that if 
mirSNPs are a significant factor behind the top-ranking 
CGEMS SNPs, high AS" scores would be enriched among 
the highest scoring blocks. Since a candidate SNP can 
have several corresponding AS due to several miRNAs 
and transcripts, we assigned to each SNP its maximum 
AS value: ASm- To test whether an increase in block 
score threshold between top-ranking SNPs and candidate 
SNPs causes any shift in the AS M distribution, we 
computed the probability density of AS M for different 
subsets of SNPs. These subsets were defined by a block 
score greater than a threshold, starting from all block 
scores and gradually reducing to only the best ones. 

Figure 5 shows for SNPs mapped to the 2112 top- 
ranking CGEMS SNPs, the distributions of AS M (from 
0.15 to 1) for several subsets of SNPs based on different 
block score thresholds. The distributions show a shift 
of the main peak at AS = 0.33 to AS = 0.53 as the 
block score threshold increases. This shift is consistent 
with mirSNPs being significant causative factors behind 
the top-ranking CGEMS SNPs. 




0.2 0.4 0.6 0.8 1.0 
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Figure 5. Distribution of mirSNP scores AS M for SNPs mapped to 
high-ranking SNPs from the CGEMS breast cancer GWAS. ASm is 
the maximum difference of scores for each SNP, where the scores are 
normalized scores from the SVM. Each curve shows the distribution for 
SNPs that have a block score greater than a given threshold. 'All' refers 
to AS M of all SNPs. '>0.9' refers to AS M of SNPs that have a block 
score >0.9 with one of the 2112 top-ranking CGEMS SNPs. The peak 
at 0.33 is decreasing as the block score threshold increases, whereas the 
peak at 0.53 is increasing with the block score threshold. 



We would also expect that the shift will be less 
pronounced if we consider more candidate SNPs (by 
using a higher rank threshold on GWAS SNPs), as these 
SNPs will likely have a higher proportion of false posi- 
tives. We therefore looked at different top-ranking thresh- 
olds to check that as the top-ranking threshold increases, 
the shift occurs later and later in terms of block score 
threshold. Figure 6A-D show 3D plots for top-ranking 
thresholds 528, 1056, 2112, and 4224. As in Figure 5, 
the plots show a shift of the main peak at AS = 0.33 to 
AS = 0.53 as the block score threshold increases. 

The lower part of the plots shows all AS M for all 
block scores — the background distribution of ASm 
scores without taking LD into account. Increasing the 
block score threshold removes mirSNPs that are not 
linked to breast cancer-associated GWAS marker SNPs, 
thereby increasing the proportion of candidate mirSNPs 
that are associated with breast cancer. The shift in AS M 
towards the right for high block score thresholds therefore 
shows that mirSNPs associated with breast cancer have a 
stronger effect on miRNA targeting than have the back- 
ground of all mirSNPs. 

As expected, increasing the threshold on top-ranking 
GWAS SNPs results in the shift occurring later and 
later on the v-axis. Using a higher top-ranking threshold 
gives a bigger proportion of false positive SNPs, whereas 
in contrast, a higher block score threshold gives a smaller 
proportion of false positives. Consequently, to compen- 
sate for the additional false positive SNPs that were 
added when increasing the rank threshold, a higher 
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Figure 6. Distributions of ASm for SNPs mapped to different numbers of high-ranking SNPs from the CGEMS breast cancer GWAS. 
The distributions vary with the number of candidate SNPs and block score thresholds. The graphs show AS M on the x-axis (range [0.15, 1]), 
complementary cumulative distribution of block scores (from all block scores on the bottom, to gradually filtering to the best block scores on the 
top) on the y-axis, and density of ASm for a given block score threshold (specifically, the distribution of ASm for SNPs that have a block score > the 
value on the v-axis) on the r-axis (in grayscale). Dark grey, light grey and white are respectively low, intermediate, and high-density values. Panels 
(A), (B), (C) and (D) show 3D plots for top-ranking thresholds 528, 1056, 2112 and 4224, respectively. The plots show a shift of the main peak at 
ASm — 0.33 to ASm — 0.53, as the block score threshold increases. 



block score threshold is needed to observe the shift in AS. 
These results indicate a link between high AS and 
high-block score top-ranking SNPs. Furthermore, the 
analyses give a good overview of how our predicted 
scores AS fit some GWAS data and show that our 
approach can identify SNPs in regulatory elements that 
may be causal in disease. 

Using TargetScan's context scores (13) computed for all 
3'-UTR haplotypes (without considering conservation), 
gave similar results indicating that the analysis is robust 
to the choice of prediction method (Supplementary 
Figures S3 and S4). 

We also repeated the analysis on a GWAS for 
Parkinson's disease. This analysis gave similar results, 
indicating that the method works with other data sets 
and diseases (Supplementary Figures S5 and S6). 
Finally, we analysed the significant trait-associated SNPs 
from the NHGRI GWAS Catalog (28) and found a 
similar shift in the AS distribution at very high-block 



scores between miRSNPs and associated SNPs from 
caucasian-based studies (Supplementary Figure S7; see 
Supplementary Table SI for the list of the best-scoring 
miRSNPs strongly linked to caucasian-based trait- 
associated SNPs). This result is consistent with us using 
Hapmap CEU haplotypes and linkage disequilibrium 
data for the analysis and indicates that miRSNPs 
explain some of the trait-associations in the NHGRI 
GWAS Catalog. 

Disease-related examples 

To further evaluate our methodology, we used it to 
analyse three miRNA/SNPs involved in breast cancer, 
asthma and Parkinson's disease. 

Saetrom et al. (33) found that the SNP rsl434536 lies in 
the target site of the microRNA miR-125b within the gene 
BMPRlb, and is associated with breast cancer. In that 
study, we used the disease mapping method presented 
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above to map the candidate SNP rs 1434536 to the breast 
cancer GWAS from CGEMS. We computed the LD block 
of rsl434536, in which we found 5 SNPs that rank within 
the 500 best in the association study (ranks 67, 79, 291, 
409 and 424) out of 528.000 SNPs; the candidate SNP lay 
in between the SNPs ranked 67 and 79 (Figure 7). The 
difference of scores for rsl434536 is 0.39. Saetrom et al. 
(33) verified that the SNP affects miR-125b's regulation of 
BMPRlb and verified the SNP's breast cancer association 
in an independent cohort. 

Tan et al. (34) found that the SNP rs 1063 320 is 
associated with asthma, depending on the mother's 
disease status. rsl063320 lies in the 3'-UTR of HLA-G, 
and the authors showed that this SNP affects miR-148a, 
miR-148b and miR-152 targeting of the HLA-G gene. 
They suggested that this altered miRNA targeting in- 
creases the risk of asthma. 

With our haplotype scoring method run genome-wide, 
we found 3 SNPs (rsl063320, rsl610696 and rsl707) 
in the 3'-UTR of HLA-G that can affect 28 miRNAs 
(data not shown). rsl063320 affects 10 miRNAs (data 
not shown), and its three largest differences of scores 
are given by the same three miRNAs reported by Tan 
et al. (34): 0.76, 0.78 and 0.81, respectively for 
miR-148b, miR-148a and miR-152. The other scores 
range from 0.33 to 0.55, indicating that the three 
miRNAs are clear candidates. 

Wang et al. (35) found that the SNP rs 12720208 is 
associated with Parkinson's disease, rs 12720208 lies in 
the 3'-UTR of FGF20. They also showed that this SNP 
has an effect on miR-433 targeting of FGF20. They sug- 
gested that this altered targeting increases the risk of 
Parkinson's disease. 

We identified two SNPs (rsl721100 and rsl2720208) in 
the 3'-UTR of FGF20 that can affect four miRNAs (data 
not shown). The largest difference of scores for this gene is 
0.88 and is given by miR-433 at rs 12720208— the same 
miRNA/SNP pair reported by Wang et al. (35). One 
other miRNA scores 0.44 with rsl2720208, whereas SNP 
rsl721 100 scores 0.24 and 0.43 with two miRNAs. 
Consequently, the pair rsl2720208/miR-433 seems to be 
a clear candidate. 

291 Input 409 




Figure 7. SNP rsl434536 (input) has an LD block (delimited by the 
dark lines) which contains top ranking SNPs (ranks 67, 79, 291, 409 
and 424) from CGEMS's breast cancer GWAS. 



DISCUSSION 

By evaluating our proposed method on allelic imbalance 
sequencing data, two different GWAS data sets, and 
validated mirSNPs, we have demonstrated that our 
method is useful for identifying potential causative SNPs 
in miRNA target sites. Specifically, our analyses of the 
allelic imbalance sequencing data show that our proposed 
method outperforms existing methods. Although the 
data set is limited as it contains only 47 SNPs, the data 
set should be of high quality as it was generated in vivo 
without artificially altering miRNA or target expression 
(31). Indeed, our results revealed clear differences 
between the methods. Especially, the method based on 
changes in predicted miRNA-mRNA hybridization 
MFE showed poor performance and could not predict 
the SNPs' effect on miRNA targeting. This result is con- 
sistent with overall miRNA-mRNA hybridization in itself 
being a poor predictor of miRNA targeting and support 
the model of target site context being essential for miRNA 
regulation (1). 

The basic approach used by many existing tools for 
detecting SNPs in miRNA target sites looks for SNPs 
in seed regions of predicted target sites. Seed regions 
are known to be the most important regions for miRNA 
targeting efficacy (1). Focusing on seed regions reduces 
the amount of false positive SNPs predicted to alter 
miRNA-targeting, but will miss SNPs affecting 
non-canonical miRNA targeting such as 3' supplementary 
sites. This basic method can however be used to filter the 
mRNA/miRNA pairs that are most likely affected by 
SNPs. Such filtered SNPs can then subsequently be 
analysed with our haplotype method. 

SNPs outside the seed region can affect miRNA target- 
ing, however, and some existing approaches based on 
computational RNA-RNA hybridization or thermo- 
dynamic calculations consider such SNPs. Our method 
can also detect SNPs in 3' supplementary sites, but accord- 
ing to our analyses, such SNPs have a small 
predicted effect (Supplementary Figure S8). This result 
is consistent with the observation that conserved 3' sup- 
plementary sites constitute 4.9% of all conserved pairing 
sites (36). As SNPs affecting seed site pairing have a bigger 
predicted effect than those affecting other miRNA 
features, our online database provide allelic sequences 
for SNPs in target seed sites. 

A transcriptome-wide study of interactions between 
miRNAs and mRNAs estimated that sites with seed 
mismatches constitute <6.6% of all miRNA target sites 
(19). By excluding SNPs in mismatch sites, we only miss 
SNPs that change a mismatch target site (weak) into 
another mismatch site. Moreover, non-canonical sites 
appear to have a smaller regulatory effect than canonical 
target sites have (19). Thus, our method focuses on iden- 
tifying the SNPs that are most likely to affect and to have 
the largest effect on miRNA targeting. 

Our haplotype scoring method is based on HapMap 
haplotype data, and only 66% of the SNPs from 
HapMap have haplotype data. The 34% HapMap SNPs 
that do not have haplotype data have a very low minimum 
allele frequency (MAF), usually 0 in the considered 
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hapmap population. Removing low MAF SNPs is an ad- 
vantage in mapping SNPs to common diseases, resulting 
in less false positives (false causal SNPs), in a common 
variant common disease model. 

Our haplotype approach also currently only focuses on 
analysing 3'-UTRs. Although miRNAs can target 
5'-UTRs and coding regions, these sites have a limited 
effect compared to 3'-UTR sites (19,20). 

The main advantage of our method compared to 
existing methods is that we analyse the regulatory effects 
of all linked genetic variations within regulatory regions, 
such as 3'-UTRs. Consequently, our method can be used 
to analyse how SNPs in multiple target sites together con- 
tribute to upregulate, downregulate, or compensate each 
other, through haplotype patterns. 

CONCLUSION 

We have presented a tool that aims at identifying 
the causative variation within regions associated with 
diseases. Specifically, the tool identifies 3'-UTR SNPs 
that can affect miRNA targeting and predicts the SNPs' 
effect on miRNA regulation. Our main result is the 
SNP effect prediction method. The results suggest that 
the effect predictions are reliable, compare favourably to 
existing methods, and can be used to filter and identify 
causative SNPs. 
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